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Abstract 

We prove that the simplex method with the highest gain/most-negative-reduced cost pivoting 
rule converges in strongly polynomial time for deterministic Markov decision processes (MDPs) 
regardless of the discount factor. For a deterministic MDP with n states and m actions, we 
prove the simplex method runs in 0{n^w? log^ n) iterations if the discount factor is uniform and 
0(n^m'^ log^ n) iterations if each action has a distinct discount factor. Previously the simplex 
method was known to run in polynomial time only for discounted MDPs where the discount was 
bounded away from 1 |Yell| . 

Unlike in the discounted case, the algorithm does not greedily converge to the optimum, and 
we require a more complex measure of progress. We identify a set of layers in which the values of 
primal variables must lie and show that the simplex method always makes progress optimizing 
one layer, and when the upper layer is updated the algorithm makes a substantial amount of 
progress. In the case of nonuniform discounts, we define a polynomial number of "milestone" 
policies and we prove that, while the objective function may not improve substantially overall, 
the value of at least one dual variable is always making progress towards some milestone, and 
the algorithm will reach the next milestone in a polynomial number of steps. 



1 Introduction 

Markov decision processes (MDPs) are a powerful tool for modeling repeated decision making in 
stochastic, dynamic environments. An MDP consists of a set of states and a set of actions that one 
may perform in each state. Based on an agent's actions it receives rewards and effects the future 
evolution of the process, and the agent attempts to maximize its rewards over time (see Section [2] 
for a formal definition). MDPs are widely used in machine learning, robotics and control, operations 
research, economics, and related fields. See the books |Put94| and |Ber96] for a thorough overview. 

Solving MDPs is also an important problem theoretically. Optimizing an MDP can be formulated 
as a linear program (LP), and although these LPs possess extra structure that can be exploited by 
algorithms like Howard's policy iteration method |How60j . they lie just beyond the point at which 
our ability to solve LPs in strongly-polynomial time ends (and are a natural target for extending 
this ability), and they have proven to be hard in general for algorithms previously thought to be 
quite powerful, such as randomized simplex pivoting rules [FHZllj . 
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In practice |LDK95j MDPs are solved using policy iteration, which may be viewed as a parallel 
version of the simplex method with multiple simultaneous pivots, or value iteration |Bel57j . an 
inexact approximation to policy iteration that is faster per iteration. If the discount factor 7, which 
determines the effective time horizon (see Section [2]), is small it has long been known that policy 
and value iteration will find an e- approximation to the optimum |Bel57| . It is also well-known that 
value iteration may be exponential, but policy iteration resisted worst-case analysis for many years. 
It was conjectured to be strongly polynomial but except for highly-restricted examples |Ma d02| 
only exponential time bounds were known [MS99j . Building on results for parity games ^ri09j, 
Fearnley recently gave an exponential lower bound [FealOj . Priedmann, Hansen, and Zwick extended 
Fearnley's techniques to achieve sub-exponential lower bounds for randomized simplex pivoting 
rules |FHZllj using MDPs, and Priedmann gave an exponential lower bound for MDPs using the 
least-entered pivoting rule [Frill j. Melekopoglou and Condon proved several other simplex pivoting 
rules are exponential |MC94j . 

On the positive side. Ye designed a specialized interior-point method that is strongly polynomial 
in everything except the discount factor [Ye05j. Ye later proved that for discounted MDPs with n 
states and m actions, the simplex method with the most-negative-reduced-cost pivoting rule and, by 
extension, policy iteration, run in time 0{nm/ (1 — 7) log(n/(l — 7))) on discounted MDPs, which is 
polynomial for fixed 7 |Yell| . Hansen, Miltersen, and Zwick improved the policy iteration bound to 
0{m/ (1 — 7) log(n/ (1 — 7))) and extended it to both value iteration as well as the strategy iteration 
algorithm for to two player turn-based stochastic games [HMZllJ. 

But the performance of policy iteration and simplex-style basis-exchange algorithms on MDPs 
remains poorly understood. Policy iteration, for instance, is conjectured to run in 0{m) iterations 
on deterministic MDPs, but the best upper bounds are exponential, although a lower bound of 
0{m) is known |HZ10j . Improving our understanding of these algorithms this is an important step 
in designing better ones with polynomial or even strongly-polynomial guarantees. 

Motivated by these questions, we analyze the simplex method with the most-negative-reduced- 
cost pivoting rule on deterministic MDPs. For a deterministic MDP with n states and m actions, 
we prove that the simplex method terminates in 0{TT'm? log^ n) iterations regardless of the discount 
factor, and if each action has a distinct discount factor, then the algorithm runs in 0(n^m^ log^ n) 
iterations. Our results do not extend to policy iteration, and we leave this as a challenging open 
question. 

Although deterministic MDPs were previously known to be solvable in strongly polynomial time 
using specialized methods not applicable to general MDPs — minimum mean cycle algorithms |PT87] 
or, in the case of nonuniform discounts, by exploiting the property that the dual LP has only two 
variables per inequality jHN94| — they were not known to be solvable in polynomial time with the 
more-generic simplex method. More generally, we believe that our results help shed some light on 
how algorithms like simplex and policy iteration function on MDPs. 

Our proof techniques, particularly in the case of nonuniform discounts, may be of independent 
interest. For uniformly discounted MDPs, we show that the values of the primal flux variables must 
lie within one of two intervals or layers of polynomial size depending on whether an action is on 
a path or a cycle. Most iterations update variables in the smaller path layer, and we show these 
converge rapidly to a locally optimal policy for the paths, at which point the algorithm must update 
the larger cycle layer and makes a large amount of progress towards the optimum. Progress takes 
the form of many small improvements interspersed with a few much larger ones rather than uniform 
convergence. 
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The nonuniform case is harder, and our measure of progress is unusual and, to the best of 
our knowledge, novel. We again define a set of intervals in which the value of variables on cycles 
must fall, and these define a collection of intermediate milestone or checkpoint values for each dual 
variable (the value of a state in the MDP). Whenever a variable enters a cycle layer, we argue that 
a corresponding dual variable is making progress towards the layer's milestone and will pass this 
value after enough updates. When each of these checkpoints have been passed, the algorithm must 
have reached the optimum. 

In Section [2] we formally define MDPs and describe a number of well-known properties that we 
require. In Section [3] we analyze the case of a uniform discount factor, and in Section |4] we extend 
these results to the nonuniform case. 

2 Preliminaries 

Many variations and extensions of MDPs have been defined, but we will study the following problem. 
A Markov decision process consists of a set of n states S and m actions A. Each action a is associated 
with a single state s in which it can be performed, a reward G M for performing the action, and 
a probability distribution Pa over states to which the process will transition when using action a. 
There is at least one action usable in each state. Let r be the vector of rewards indexed by a with 
entries r^. As C A be the set of actions performable in state s, and P be the n by m matrix with 
columns Pa- We will restrict the distributions Pa to be deterministic for all actions, in which case 
states may be thought of as nodes in a graph and actions as directed edges. However, the results in 
this section apply to MDPs with stochastic transitions as well. 

At each time step, the MDP starts in some state s and performs an action a admissible in 
state s, at which point it receives the reward and transitions to a new state s' according to the 
probability distribution Pa- We are given a discount factor 7 < 1 as part of the input, and our goal 
is to choose actions to perform so as to maximize the expected discounted reward we accumulate 
over an infinite time horizon. The discount can be thought of as a stopping probability — at each 
time step the process ends with probability 1 — 7. 

Due to the Markov property — transitions depend only the current state and action — there is an 
optimal strategy that is memoryless and depends only on the current state. Let tt be such a policy, 
a distribution of actions to perform in each state. This defines a Markov chain and a value for each 
state: 

Definition 2.1. Let tt be a policy, P^ be the n by n matrix where P^^, is the probability of 
transitioning from s' to s using tt, and the vector of expected rewards for each state according 
to the distribution of actions in tt. The value vector v'^ is indexed by states, and is equal to 
the expected total discounted reward of starting in state s and following policy tt. It is defined as 
= J2i>oi'yi^^)'^y''^T^ = ~ lP'^)^'^f-n or equivalently by 

= r^ + -f{P^fv\ (1) 

If policy TT is randomized and uses two or more actions in some state s, then the value of is 
an average of the values of performing each of the pure actions in s, and one of these is the largest. 
Therefore we can replace the distribution by a single action and only increase the value of the state. 
In the remainder of the paper we will restrict ourselves to pure policies in which a single action is 
taken in each state. 



3 



In addition to the value vector, a policy vr also has an associated flux vector x'^ that will play a 
critical role in our analysis. Suppose we start with a single unit of "mass" on every state and then 
run the Markov chain. At each time step we remove 1 — 7 fraction of the mass on each state and 
redistribute the remaining mass according to the policy vr. Summing over all time steps, the total 
amount of mass that passes through each action is its flux. More formally, 

Definition 2.2. Let ir be a policy and P'^ the n by n transition matrix for vr formed by the columns 
Pa for actions in vr. The flux vector x'^ is indexed by actions. If action a is not in vr then = 0, 
and if IT uses a in state s, then = y^, where 

y=Y.{iP-ri = {i-iP-r^i, (2) 

and 1 is the all ones vector of dimension n. The flux is the total discounted number of times we use 
each action if we start the MDP in all states and run the Markov chain P^ discounting by 7 each 
iteration. 

Note that if a G vr then xj > 1, since the initial flux placed on a's state always passes through a. 
Further note that each bit of flux can be traced back to one of the initial units of mass placed on 
each state, although the vector x'^ sums flux from all states. This will be important in Section [4} 

Solving the MDP can be formulated as the following primal/dual pair of LPs, in which the flux 
and value vectors correspond to primal and (possibly infeasible) dual solutions: 

Primal: 

maximize J2a ^a^a 

subject to Vs G S, EaeA, = 1 + 7 Ea 

Pa,s^a 

X > 

Dual: 

minimize Xls '^s (4) 
subject to Vs G 5, a G Ag, > + 7 Xls' Pa,s'^s' 

The constraint matrix of ([s]) is equal to M — 7P, where Ms^a = 1 if action a can be used in 
state s and otherwise. Vertices of the primal polytope represent policies: 

Lemma 2.3. There is a bijection between vertices of the polytope ([s]) and policies of the MDP. 

Proof. Policies have exactly n nonzero variables, and solving for the flux vector in ([2]) is identical 
to solving for a basis in the polytope, so policies map to vertices. Write the constraints in ^ in 
the standard matrix form ^x = b. The vector b is 1, and A = M — jP. In a row s oi A the only 
positive entries are on actions usable in state s, so if Ax = b, then x must have a nonzero entry for 
every state, i.e., a choice of action for every state. Bases of the LP have n variables, so they must 
include only one action per state. □ 



By Lemma 2.3, the simplex method applied to ^ corresponds to a simple, single-switch version 



of policy iteration: we start with an arbitrary policy, and in each iteration we change a single action 
that improves the value of some state. The LP is not degenerate, since, as shown above, xj > 1 for 
all a in the basis. Therefore the simplex method will find the optimal policy with no cycling. We 
will use Dantzig's most-negative-reduced-cost pivoting rule to choose the action switched. Since ^ 
is written as a maximization problem, we will refer to reduced costs as gains and always choose the 
highest gain action to switch/pivot. For MDPs, the gains have a simple interpretation: 
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Definition 2.4. The gain (or reduced costj of an action a for state s with respect to a policy tt is 
denoted and is the improvement in the value of s if s uses action a once and then follows vr for 
all time. Formally, = (r^ -\-jP]^v'^) — Vg, or, in vector form 

r"" = r- {M --iPfv^ . (5) 

We denote the optimal policy by vr*, and the optimal flux, values, and gains by x*, v*, and r*. 
The following are basic properties of the simplex method, and we prove them for completeness. 

Lemma 2.5. Let vr and vr' he any policies. The gains satisfy the following properties 

• {r^)'^x''' = r^x''' - r^x^ = f v^' - 1^ , 

• r J = for all a £ tt, and 

• r* < for all a. 

Proof From the definition of gains {r'')'^x''' = (r - (M - jP)^v'')'^x^' = r'^x^' - (v'')'^(M - 

using that (M — jP) is the constraint matrix of ([s]). From the definition 
of value and flux vectors r'^x'^ = r'^{I — jP'^)~^l = (v'^)-^l, where is the reward vector restricted 
to indices vr. Combining these two gives the first result. 

For the second result, if a is in vr, then = + jPj'v'" , so = 0. Finally, if r* > for some 
a, then consider the policy vr that is identical to vr* but uses a. Then (r*)-^x'^ > 0, and the first 
identity proves that vr* is not optimal. □ 

A key property of the simplex method on MDPs that we will employ repeatedly is that not only 
is the overall objective improving, but also the values of all states are monotone non-decreasing, 
and there exists a single policy we denote by vr* that maximizes the values of all states: 

Lemma 2.6. Let it and vr' be policies appearing in an execution of the simplex method with vr' being 
used after it. Then > . Further, let vr* be the policy when simplex terminates, and vr" be any 
other policy. Then v* > . 

Proof. Suppose vr and vr' are subsequent policies. The gains of all actions in vr' with respect to vr 
are equal to r^/ — (/ — ^P'^ )"^v'^, all of which are nonnegative. Therefore < (/ — ^P'^ )~^(r^/ — 
(/ - 7P''')^)v'' = v'^' - v'^, using that (/ - -fP'^')''^ = Ei>o(7(^'')^)* ^ ^ ^y induction, this 
holds if vr and vr' occur further apart. Performing a similar calculation using the gains r*, which are 
nonpositive, shows that v* — v'^ > for any policy vr". □ 



3 Uniform discount 

As a warmup before delving into our analysis of deterministic MDPs, we briefly review the analysis 



of [Yellj for stochastic MDPs with a fixed discount. Consider the flux vector in Definition 2.2 One 
unit of flux is added to each state, and every step it is discounted by a factor of 7, for a total of 
n(l + 7 + 7^ + • • • ) = n/(l — 7) flux overall. If vr is the current policy and A is the highest gain, 
then, by Lemma |2.5| the farthest vr can be from vr* is if all n/{l — 7) units of flux in vr* are on the 
action with gain A, so r-^x* — r-^x'^ < nA/(l — 7). If we pivot on this action, at least 1 unit of flux 
is placed on the new action, increasing the objective by at least A. Thus we have reduced the gap 
to vr* by a (1 — 7)/n fraction, which is substantial if 1/(1 — 7) is polynomial. When the gap has 
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been reduced sufficiently, we are able to eliminate an action from all future policies. See [Yellj for 
the details. 

The above result hinged on the fact that the size of all nonzero flux lay within the interval 
[1, n/(l — 7)], which was assumed to be polynomial but gives a weak bound if 7 is very close to 1. 
However, consider a policy for a deterministic MDP. It can be seen as a graph with a node for each 
state with a single directed edge leaving each state representing the action, so the graph consists 
of one or more directed cycles and directed paths leading to these cycles. Starting on a path, the 
MDP uses each path action once before reaching a cycle, so the flux on paths must be small. Flux 
on the cycles may be substantially larger, but since the MDP revisits each action after at most n 
steps, the flux on cycle actions varies by at most a factor of n. 

Lemma 3.1. Let tt be a policy with flux vector and a an action in vr. If a is on a path in tt then 
1 ^ f^a — '^^^ ^/ ci is on a cycle then 1/(1 — 7) < < "-/(I ~ t)- total flux on paths is at 
most n} , and the total flux on cycles is at most 71/(1 — 7). 

Proof. All actions have at least 1 flux. If a is on a path, then starting from any state we can only 
use a once and never return, contributing flux at most 1 per state, so x^ < n. Summing over all 
path actions, the total flux is at most n'^. 

If a is on a cycle, each state on the cycle contributes a total of 1/(1 — 7) flux to the cycle. By 
symmetry this flux is distributed evenly among actions on the cycle, so x^ > 1/(1 — 7). The total 
flux in the MDP is n/(l - 7), so x^ < n/(l - 7). □ 

The overall range of flux is large, but all values must lie within one of two polynomial layers. We 
will prove that simplex can essentially optimize each layer separately. If a cycle is not updated, then 
not much progress is made towards the optimum, but we make a substantial amount of progress in 
optimizing the paths for the current cycles. When the paths are optimal the algorithm is forced to 
update a cycle, at which point we make a substantial amount of progress towards the optimum but 
resets all progress on the paths. 

First we analyze progress on the paths: 

Lemma 3.2. Suppose the simplex method pivots from vr to vr', which does not create a new cycle. 
Let vr" be the final policy such that cycles in vr" are a subset of those in tt (i.e., the final policy before 
a new cycle is created). Then r^{x^ — ) < {1 — l/n'^)r-^ {x^ — x^). 

Proof. Let A = max^ be the highest gain. Consider (r'^)-^x'^". Since cycles in vr" are contained 
in vr, = for any action a on a cycle in vr", and by Lemma 3.1 , vr" has at most units of flux 
on paths, so (r'^)^x'^ = r^(x'^ — x'^) < n^A. 

Policy vr' has at least 1 unit of flux on the action with gain A, so 

r^(x-" - X-') < r^(x^" - x^) - A < (^1 - -^2) r^(x'^" - x'^) . □ 

Due to the polynomial contraction in the lemma above, not too many iterations can pass before 
a new cycle is formed. 

Lemma 3.3. After 0{n'^ log n) iterations either the algorithm finishes, a new cycle is created, a 
cycle is broken, or some action never appears in a policy again before a new cycle is created. 
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Proof. Let vr be the policy in some iteration, tt' the last policy before a new cycle is created, and 
tt" an arbitrary policy occurring between vr and vr' in the algorithm. Policy vr differs from vr' in 



actions on paths and possibly in cycles that exist in vr but have been broken in vr'. By Lemma 2.5 
-(r'^yx'^ = r^(x'^' - x'^) = l'^(v'^' - v'^). 

We divide the analysis into two cases. First suppose that there exists an action a used in state 
s on a path such that — xj > — (r'^ )'^'k^ /n (note (r'^ )-^x'^ < 0). Since a is on a path x^ < n, 
which implies — > — (r'^ )"^x'^. Now if policy vr" uses action a, then 



>v- - (r„ + jPaV^ ) = -rl > ^ 



using that the values of all states are monotone increasing. 

In the second case there is no action a on a path in vr satisfying — x^ > — (r'^ )-^x'^/n. The 
remaining portion of — (r'^ )^x'^ is due to cycles, so there must be some cycle C consisting of actions 
{ai, . . . , afc} used in states {si, . . . , s^} such that YlaeC — —{^'^ y-xJ^/n. For each a ^ C, 

K < n/{l - 7), so -n2 Y.^^^ r-7(l - 7) > -(r-^x-. 

As long as cycle C is intact, each a £ C has 1/(1 — 7) flux from states in C (Lemma 3.1), so if 
C is in policy vr" then 



{r^')V = l^(v-' - > vf - vf = - ^f^ ;" > " . (6) 

sec 



Z^"^ 1-7 - n2 



Now if log„2/(„2_i) n iterations occur between vr and vr , Lemma 3.2 implies 
-(r-')V<-(l-i.) (r-'fx-< ' 



In the first case action a cannot appear in vr", and in the second case cycle C must be broken broken 
in vr". This takes log^2 /(„2_x) "n? = 0{'n? logn) iterations if no new cycles interrupt the process. □ 

Lemma 3.4. Either the algorithm finishes or a new cycle is created after 0{n'^mlogn) iterations. 

Proof. Let vro be a policy after a new cycle is created, and consider the policies vri,vr2, . . . each 



separated by 0{n logn) iterations. If no new cycle is created, then by Lemma 3.3 each of these 
policies vrj has either broken another cycle in vro or contains an action that cannot appear in vr-,- 
for all j > i. There are at most n cycles in vro and at most m actions that can be eliminated, so 
after (m + n)0{n'^ logn) = ©(n^mlogn) iteration, the algorithm must terminate or create a new 
cycle. □ 

When a new cycle is formed, the algorithm makes a substantial amount of progress towards the 
optimum but also resets the path optimality above. 

Lemma 3.5. Let vr and vr' be subsequent policies such that vr' creates a new cycle. Then {x* — 
< (1 - l/n)r^{x* - x""). 
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Proof. Let A = maxa' r^, and a = argmax^j/ rj[. There is a total of n/(l — 7) flux in the MDP, so 
r'^x* — r^x'^ = (r'^)"^x* < An/(1 — 7). By Lemma 3.1 pivoting on a and creating a cycle will result 



in at least 1/(1 — 7) flux through a. Therefore r x'^ > r x'^ + A/(l — 7), so 

r^(x* - x^') < r^(x* - x^) - — < f 1 - r^(x* - x^) . □ 



1 — 7 \ n 

Lemma 3.6. After 0(n log n) new cycles are created then either some action is either eliminated 
from cycles or entirely eliminated from policies for the remainder of the algorithm. 

Proof. Consider a policy vr with respect to the optimal gains r*. There is an action a such that 
— r*xj > — (r*)"^x'^/n. If a is on a path in vr, then 1 < x^ < n, so — r* > — (r*)"^x'^/n^, and if a is 
on a cycle, then 1/(1 - 7) < x^ < n/(l - 7), so -r*/(l - 7) > -(r*)^x'^/n2. 

Since r* are the gains for the optimal policy, r*, < for all a' . Therefore if vr' is any policy 
containing a, then — r* < — r*x^ < — (r*)-^x'^ , and if v r' is any policy containing a on a cycle, then 
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— r*/ (1 — 7) < — r*x^ < — (r*) x'^ . Now by Lemma 3.5, if there are more than log„/(„_]^) n 
0(n log n) new cycles created between policies vr and vr' then 

Therefore if vr contained a on a path, then a cannot appear in any policy after vr' for the remainder 
of the algorithm, and if vr contained a on a cycle, then a cannot appear in a cycle (but may appear 
in a path) after vr' for the remainder of the algorithm. □ 

Theorem 3.7. The simplex method converges in at most Oin^rn^ log^ n) iterations on deterministic 
MDPs with uniform discount using the highest gain pivoting rule. 

Proof. Consider the policies vro, vri, 7r2, . . . where 0(n log n) new cycles have been created between 



■Ki and vrj+i. By Lemma 3.6, each vr^ contains an action that is either eliminated entirely in vrj for 



j > i or eliminated from cycles. Each action can be eliminated from cycles and paths, so after 2m 



such rounds of O(nlogn) new cycles the algorithm has converged. By Lemma 3.4 cycles are created 



every 0(n^m log n) iterations, for a total of O(n^m^log^n) iterations. □ 



4 Varying Discounts 

In this section we allow each action a to have a distinct discount 7a. This signiflcantly complicates 
the proof of convergence since the total flux is no longer fixed. When updating a cycle we can no 
longer bound the distance to the optimum based solely on the maximum gain, since the optimal 
policy may employ actions with smaller gain to the current policy but substantially more flux. 

We are able to exhibit a set of layers in which the flux on cycles must lie based on the discount 
of the actions, and we will show that when a cycle is created in a particular layer we make progress 
towards the optimum value for the updated state assuming that it lies within that layer. These 
layers will define a set of bounds whose values we must surpass, which serve as milestones or 
checkpoints to the optimum. When we update a cycle we cannot claim that the overall objective 
increases substantially but only that the values of individual states make progress towards one of 
these milestone values. When the values of all states have surpassed each of these intermediate 
milestones the algorithm will terminate. 
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We first define some notation. Recall that to calculate flux we place one unit of "mass" in each 
state and then run the Markov chain, so all flux traces back to some state, but x'^ aggregates all 
of it together. Because we will be concerned with analyzing the values of individual states in this 
section, it will be useful to separate out the flux originating in a particular state s. Consider the 
following alternate LP: 



maximize r-^x 



subject to Y.aeAs + laPa,sXa 

X > 



(7) 



The LP ([7]) is identical to (|3|, except that initial flux is only added to state s rather than 
all states, and the dual of ([T]) matches Q if the objective in Q is changed to minimize only v^. 
Feasible solutions in ([T]) measure only flux originating in s and contributing to v^. For a state s and 
policy vr we use the notation x'^'* to denote the corresponding vertex in Q. Note that x'^ = x'^'*. 



The following lemma is analogous to Lemma 2.5 and has an identical proof: 



Lemma 4.1. For a state s and for policies vr and vr', (r'^)-^a;'^''* = r^x^''^ — r^x^'^ = v^' — . 

We now define the intervals in which the flux must lie. As in Section [3] flux on paths is in [l,n]. 
Let C be a cycle in some policy, and 7c = HaeC ^« total discount of C. We will prove that the 
smallest discount in C determines the rough order of magnitude of the flux through C. 

Definition 4.2. Let C be a cycle and a an action in C , then the discount of a dominates the 
discount of C if Ja ^ 7a' for all a' G C. 

Lemma 4.3. Let it be a policy containing the cycle C with discount dominated by 7^ and total 
discount 'jc- Let s be a state on C , a' the action used in s and a" an arbitrary action in C , then 

• = 1/(1 -7c), 

• 7c/(l - 7c) < x'jf < 1/(1 - 7c); and 

. l/(n(l - 7a)) < 1/(1 - 7c) < 1/(1 - 7a). 

Proof. For the first equality, all flux originates at s, so the flux through a' (used in state s) either just 
originated in s or came around the cycle from s, implying x^;^ = 1 + 7cx^,'*. An analogous equation 
holds for all other actions a" on C, but now the initial flow from s may have been discounted by at 
most 7c before reaching a", giving 7c/(l — 7c) < x^"* ^ 1/(1 ~ 7c)- 

The upper bound in the final inequality, 1/(1— 7c) < 1/(1 — 7^) holds since a £ C (7a dominates 
the discount of C). For the lower bound, let ^ = 1 — 7a. Then 7c > 7a = (1 ~ ^)" > 1 ~ "-^ = 
1 - n(l - 7a), implying 1/(1 - 7c) > l/{n{l - 7a)). □ 

Flux on paths still falls in [1,^], so the algorithm behaves the same on paths as it did in the 
uniform case: 

Lemma 4.4. Either the algorithm finishes or a new cycle is created after 0{n'^m log n) iterations. 
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Proof. This is identical to the proof of Lemma 3.4 which depends on Lemmas 3.2 and |3.3[ Lemma 



3.2 holds for nonuniform discounts, and Lemma 3.3 holds after adjusting Equation ^ as follows 



using that Yla<^c^^a '^1 ^ ~ ^C*) ^ — (r'^')"^x'^/n and Lemma 



4.3 



□ 



Now suppose the simplex method updates the action for state s in policy tt and creates a cycle 
dominated by 7^. Again, may not improve much, since there may be a cycle with discount much 
larger than 7a. However, in any policy vr' where s is on a cycle dominated by 7a and s uses some 
action a', l/(n(l — 7a)) < x^,''' < 1/(1 — 7^), which allows us to argue has made progress towards 
the highest value achievable when it is on a cycle dominated by 7a, and after enough such progress 
has made, v<j will beat this value and never again appear on any cycle dominated by 7a. The optimal 
values achievable for each state on a cycle dominated by each 7a serve as the above-mentioned 
milestones. Since all cycles are dominated by some 7a, there are m milestones per state. 

Lemma 4.5. Suppose the simplex method moves from tt to vr' by updating the action for state s, 
creating a new cycle C with discount dominated by ■ja for some a in vr'. Let vr" be the final policy used 
by the simplex method in which s is in a cycle dominated by 7a. Then —v^ 

Proof. Let A = maxa' r^, be the value of the highest gain with respect to vr. Any cycle contains 
at most n actions, eac h of which has gain at most A in r'^, so if s is on a cycle dominated by 7a 



in vr" then by Lemma 4.3 and Lemma 4.1, — < nA/(l — 7a), and since vr' creates a cycle 



dominated by 7a, by the same lemmas > + A/(n(l — 7a))- Combining the two, 

n{l - 7a) ~ V "-^ 



The following lemma is the crux of our analysis and allows us to eliminate actions when we get 
close to a milestone value. This occurs because the positive gains must shrink or else the algorithm 
would surpass the milestone, and as the positive gains shrink they can no longer balance larger 
negative gains, forcing such actions out of the cycle. 

Lemma 4.6. Suppose policy it contains a cycle C with discount dominated by 7a and s is a state in 
C. There is some action a' in C (depending on s) such that after 0{n? logn) iterations that change 
the action for s and create a cycle with discount dominated by 7a, action a' will never again appear 
in a cycle dominated by 7a. 

Proof. Let vr be a policy containing a cycle C with discount dominated by 7a and s a state in C. Let vr' 
be another policy where s is on a cycle dominated by 7a after at least l+log„2 /(n'^-i) = 0(n^ logn) 
iterations that create such a cycle by changing the action for s and vr" the final policy used by the 
algorithm in which s is on a cycle dominated by 7a. 



Consider the policy tt in the iteration immediately preceding vr'. By Lemma 4.5 and the choice 
of vr', 

„ . / I \l°Sn2/(„2-l)"' „ 1 

vj -vl< 1-^ K -v^) = ^(v: -v^), 
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or equivalently Vg — v^" < — n^(v^" — vj), implying 

vl - vf = (v- - vf ) + (v-" - vf ) < i-n^ + l)(v-" - vf ) . (8) 
Since the gap — is large and negative, there must be highly negative gains in r'^ . By Lemma 



4.1 



= (r'^) x'^'*'. Let r^, = minagc" and s' be the state using a'. By Lemma 



x'^'* < 1/(1 — 7a), and C has at most n states, so applying Equation 



4.3 



. <-(vJ-v:)< -n^ + - (vj -v:). (9) 
l-7a n V 

The positive entries in r'^ must all be small, since there is only a small increase in the value of 
s. Let A = maxr'^. The algorithm pivots on the highest gain, and by assumption it updates the 



action for s and creates a cycle dominated by 7a. By Lemma 4.3, the new action is used at least 



l/(n(l — 7a)) times by flux from s, since it is the first action in the cycle, so 

A 



n{l - 7a) 



<v: -v:<v: -vj. (10) 



We prove that the highly negative r^, cannot coexist with only small positive gains bounded by 
A. Consider any policy in which s' is on a cycle C containing a' (but not necessarily containing s) 



with total gain jc' dominated by 7a. By Lemma 4.3, there is at least 1/(1 — jc') ^ 1/(^^(1 ~ 7a)) 



flux from s going through a' , and in the rest of the cycle there are at most n — I other actions with 
at most 1/(1 — 7c/) < 1/(1 — 7a) flux. The highest gain with respect to vr is A, so the value of v^i 
relative to r'^ is at most 

^a' , ^ ^3 , M ,M , ^2f,y' 



n(l - 7a) 1 - 7a V 



using Equations ^ and (10). But vj = relative to r'^, and it only increases in future iterations, 
so a' cannot appear again in a cycle dominated by 7a. □ 

Lemma 4.7. For any action a, there are at most 0{n'^mlogn) iterations that create a cycle with 
discount dominated by ja- 

Proof. After 0(n'^ log n) iterations that create a cycle dominated by 7a, some state must have been 
updated in O(n^logn) of those iterations, so by Lemma 4.6 some action will never appear again in 



a cycle dominated by 7a. After m repetitions of this process all actions have been eliminated. □ 

Theorem 4.8. Simplex terminates in at most 0{n^m^ log^ n) iterations on deterministic MDPs 
with nonuniform discounts using the highest gain pivoting rule. 



Proof. There are 0{m) possible discounts 7a that can dominate a cycle, and by Lemma 4.7 there 
are at most 0(n^m log n) iterations creating a cycle dominated by any particular 7a, for a total of 



0{n m log n) iterations that create a cycle. By Lemma 4.4 a new cycle is created every 0{n m log n 



iterations, for a total of 0(n m log n) iterations overall. □ 
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5 Open problems 



A difficult but natural next step would be to try to extend these techniques to handle policy iteration 
on deterministic MDPs. The main problem encountered is that the multiple simultaneous pivots 
used in policy iteration can interfere with each other in such a way that the algorithm effectively 
pivots on the smallest improving switch rather than the largest. See |HZ10j for such an example. 
Another challenging open question is to design a strongly polynomial algorithm for general MDPs. 
Finally, we believe the technique of dividing variable values into polynomial sized layers may be 
helpful for entirely different problems. 
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